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Abstract — Sparse linear regression - finding an unknown 
vector from linear measurements - is now linown to be possible 
with fewer samples than variables, via methods like the LASSO. 
We consider the multiple sparse linear regression problem, where 
several related vectors - with partially shared support sets - have 
to be recovered. A natural question in this setting is whether one 
can use the sharing to further decrease the overall number of 
samples required. A line of recent research has studied the use of 
li /Iq norm block-regularizations with q> 1 for such problems; 
however these could actually perform worse in sample complexity 
- vis a vis solving each problem separately ignoring sharing - 
depending on the level of sharing. 

We present a new method for multiple sparse linear regression 
that can leverage support and parameter overlap when it exists, 
but not pay a penalty when it does not. a very simple idea: we 
decompose the parameters into two components and regularize 
these differently. We show both theoretically and empirically, our 
method strictly and noticeably outperforms both (.i or ti/lq 
methods, over the entire range of possible overlaps (except at 
boundary cases, where we match the best method). We also 
provide theoretical guarantees that the method performs well 
under high-dimensional scaling. 

Index Terms — Multi-task Learning, High-dimensional Statis- 
tics, Multiple Regression. 

I. Introduction: Motivation and Setup 

High-dimensional scaling. In fields across science and engi- 
neering, we are increasingly faced with problems where the 
number of variables or features p is larger than the number of 
observations n. Under such high-dimensional scaling, for any 
hope of statistically consistent estimation, it becomes vital to 
leverage any potential structure in the problem such as sparsity 
(e.g. in compressed sensing and LASSO Hvk ). low-rank 
structure ||16i |12fl. or sparse graphical model structure fl?]. It 
is in such high-dimensional contexts in particular that multi- 
task learning ||4] could be most useful. Here, multiple tasks 
share some common structure such as sparsity, and estimating 
these tasks jointly by leveraging this common structure could 
be more statistically efficient. 

Block-sparse Multiple Regression. A common multiple task 
learning setting, and which is the focus of this paper, is that of 
multiple regression, where we have r > 1 response variables, 
and a common set of p features or covariates. The r tasks could 
share certain aspects of their underlying distributions, such as 
common variance, but the setting we focus on in this paper 
is where the response variables have simultaneously sparse 
structure: the index set of relevant features for each task is 
sparse; and there is a large overlap of these relevant features 
across the different regression problems. Such "simultaneous 
sparsity" arises in a variety of contexts 11811 : indeed, most 
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applications of sparse signal recovery in contexts ranging 
from graphical model learning, kernel learning, and function 
estimation have natural extensions to the simultaneous-sparse 
setting iiliQ. 

It is useful to represent the multiple regression parameters 
via a matrix, where each column corresponds to a task, and 
each row to a feature. Having simultaneous sparse structure 
then corresponds to the matrix being largely "block-sparse" - 
where each row is either all zero or mostly non-zero, and the 
number of non-zero rows is small. A lot of recent research 
in this setting has focused on ti/tq norm regularizations, for 
q > 1, that encourage the parameter matrix to have such block- 
sparse structure. Particular examples include results using the 
^i/^oo norm HI El [11, and the I1/I2 norm [10, 13]. 
Our Model. Block-regularization is "heavy-handed" in two 
ways. By strictly encouraging shared-sparsity, it assumes that 
all relevant features are shared, and hence suffers under 
settings, arguably more realistic, where each task depends on 
features specific to itself in addition to the ones that are com- 
mon. The second concern with such block-sparse regularizers 
is that the jlq norms can be shown to encourage the entries 
in the non-sparse rows taking nearly identical values. Thus we 
are far away from the original goal of multitask learning: not 
only do the set of relevant features have to be exactly the same, 
but their values have to as well. Indeed recent research into 
such regularized methods [fll, 13 1 caution against the use of 
block-regularization in regimes where the supports and values 
of the parameters for each task can vary widely. Since the 
true parameter values are unknown, that would be a worrisome 
caveat. 

We thus ask the question: can we learn multiple regression 
models by leveraging whatever overlap of features there exist, 
and without requiring the parameter values to be near iden- 
tical? Indeed this is an instance of a more general question 
on whether we can estimate statistical models where the data 
may not fall cleanly into any one structural bracket (sparse, 
block-sparse and so on). With the explosion of complex and 
dirty high-dimensional data in modern settings, it is vital to 
investigate estimation of corresponding dirty models, which 
might require new approaches to biased high-dimensional 
estimation. In this paper we take a first step, focusing on such 
dirty models for a specific problem: simultaneously sparse 
multiple regression. 

Our approach uses a simple idea: while any one structure 
might not capture the data, a superposition of structural classes 
might. Our method thus searches for a parameter matrix that 
can be decomposed into a row-sparse matrix (corresponding 
to the overlapping or shared features) and an elementwise 
sparse matrix (corresponding to the non-shared features). As 
we show both theoretically and empirically, with this simple 
fix we are able to leverage any extent of shared features, while 
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allowing disparities in support and values of the parameters, so 
that we are always better than both the Lasso or block-sparse 
regularizers (at times remarkably so). 

The rest of the paper is organized as follows: In Sec 2. 
basic definitions and setup of the problem are presented. Main 
results of the paper is discussed in sec 3. Experimental results 
and simulations are demonstrated in Sec 4. 

Notation: For any matrix M, we denote its j*'* row as 
nij, and its fc-th column as m'^'^^. The set of all non-zero 
rows (i.e. all rows with at least one non-zero element) is 
denoted by RowSupp(M) and its support by Supp(Af). Also, 
for any matrix M, let ||M||i_i := J2j k i-^- '^^e sums of 

absolute values of the elements, and ||Af ||i_oo '■— J2j ll'^jlloo 



where, llm,- 
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II. Problem Set-up and Our Method 

Multiple regression. We consider the following standard mul- 
tiple linear regression model: 



k = l, 



y 



where, y'^^^ G M" is the response for the fc-th task, regressed 
on the design matrix X'^^^ e IR"^^ (possibly different across 
tasks), while w'^^^ € K" is the noise vector We assume each 
is drawn independently fmmM{{). a^). The total number 



(fc) 



w 

of tasks or target variables is r, the number of features is 
p, while the number of samples we have for each task is n. 
For notational convenience, we collate these quantities into 
matrices Y e M"^^ for the responses, 9 € for the 

regression parameters and W E M"^'' for the noise. 
Our Model. In this paper we are interested in estimating 
the true parameter 8 from data {y'^''\ X'-''^} by leveraging 
any (unknown) extent of simultaneous-sparsity. In particular, 
certain rows of 6 would have many non-zero entries, 
corresponding to features shared by several tasks ("shared" 
rows), while certain rows would be elementwise sparse, 
corresponding to those features which are relevant for some 
tasks but not all ("non-shared rows"), while certain rows 
would have all zero entries, corresponding to those features 
that are not relevant to any task. We are interested in 
estimators Q that automatically adapt to different levels of 
sharedness, and yet enjoy the following guarantees: 

Support recovery: We say an estimator & 
successfully recovers the true signed support if 
sign(Supp(9)) — sign(Supp(8)). We are interested in 
deriving sufficient conditions under which the estimator 
succeed. We note that this is stronger than merely recovering 
the row-support of Q, which is union of its supports for the 
different tasks. In particular, denoting for the support of 
the fc-th column of 0, and U = Ufc^fc- 

Error bounds: We are also interested in providing^ bounds 
on the elementwise iao norm error of the estimator Q, 
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A. Our Method 

Our method models the unknown parameter as a su- 
perposition of a block-sparse matrix B (corresponding to 
the features shared across many tasks) and a sparse matrix 
S (corresponding to the features shared across few tasks). 
We estimate the sum of two parameter matrices B and S 
with different regularizations for each: encouraging block- 
structured row-sparsity in B and elementwise sparsity in S. 
The corresponding simple models would either just use block- 
sparse regularizations 1 1 li 11311 or just elementwise sparsity 
regularizations llTl 12111 . so that either method would per- 
form better in certain suited regimes. Interestingly, as we 
will see in the main results, by explicitly allowing to have 
both block-sparse and elementwise sparse component (see 
Algorithm Ill-Al l, we are able to outperform both classes of 
these "clean models", for all regimes Q. 

III. Main Results and Their Consequences 

We now provide precise statements of our main results. A 
number of recent results have shown that the Lasso 11171 121 j 
and £i/ioo block-regularization ill ill methods succeed in 
model selection, i.e., recovering signed supports with con- 
trolled error bounds under high-dimensional scaling regimes. 
Our first two theorems extend these results to our model 
setting. In Theorem [T] we consider the case of deterministic 
design matrices X^''\ and provide sufficient conditions guar- 
anteeing signed support recovery, and elementwise ^oo norm 
error bounds. In Theorem |2] we specialize this theorem to 
the case where the rows of the design matrices are random 
from a general zero mean Gaussian distribution: this allows 
us to provide scaling on the number of observations required 
in order to guarantee signed support recovery and bounded 
elementwise £oo norm error 

Our third result is the most interesting in that it explicitly 
quantifies the performance gains of our method vis-a-vis Lasso 
and the £i/£oo block-regularization method. Since this entailed 
finding the precise constants underlying earlier theorems, and 
a correspondingly more delicate analysis, we follow Negahban 
and Wainwright [11] and focus on the case where there are 
two-tasks (i.e. r = 2), and where we have standard Gaussian 
design matrices as in Theorem |2] Further, while each of two 
tasks depends on s features, only a fraction a of these are 
common. It is then interesting to see how the behaviors of 
the different regularization methods vary with the extent of 
overlap a. 

Comparisons. Negahban and Wainwright [11] show that there 
is actually a "phase transition" in the scaling of the probability 
of successful signed support-recovery with the number of 
observations. Denote a particular rescaling of the sample-size 
OLasso{n,p,a) = ^ iog"p_g) ■ Then as Wainwright [21] show, 
when the rescaled number of samples scales as Lasso > 2 + S 
for any 6 > Q, Lasso succeeds in recovering the signed 
support of all columns with probability converging to one. 
But when the sample size scales as O^asso < 2 — 5 for any 
(5 > 0, Lasso fails with probability converging to one. For the 
£i/^oo -regularized multiple linear regression, define a similar 
rescaled sample size 6*1^00 ("-iP, a) — ..^....i^ "o ■ Then as 
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Algorithm 1 Complex Block Sparse 

Solve the following convex optimization problem: 

1 ^ 2 

(5,B)Gargmin ^ LW _ fs^*^ + b"=') il + A.HS'lli,! + AtHBHi.oo. (1) 

S,B 2n \ / 2 

fc=i 

Then output Q ^ B + S. 



Negahban and Wainwright fT7| show there is again a transition 
in probability of success from near zero to near one, at the 
rescaled sample size of 6*1.00 = (4 — 3q;). Thus, for a < 2/3 
("less sharing") Lasso would perform better since its transition 
is at a smaller sample size, while for a > 2/3 ("more sharing") 
the ii/ioo regularized method would perform better. 

As we show in our third theorem, the phase transition for 
our method occurs at the rescaled sample size of 6*1 00 = (2 — 
a), which is strictly before either the Lasso or the ii/£oo 
regularized method except for the boundary cases: a — 0, 
i.e. the case of no sharing, where we match Lasso, and for 
a = 1, i.e. full sharing, where we match ii/too- Everywhere 
else, we strictly outperform both methods. Figure shows the 
empirical performance of each of the three methods; as can 
be seen, they agree very well with the theoretical analysis. 
(Further details in the experiments Section |TV] ). 

A. Sufficient Conditions for Deterministic Designs 

We first consider the case where the design matrices X*^'') 
for k — 1, - • -jT are deterministic, and start by specifying 
the assumptions we impose on the model. We note that 
similar sufficient conditions for the deterministic X^'^^'s 
case were imposed in papers analyzing Lasso 1I21I1 and 
block-regularization methods 

AO Column Normalization: ||xj'^''||2 < \/2n for all 
j = 1, . . . ,_p and fc 1, . . . ,r. 



Al Incoherence Condition: 



76 := 1 — max > 



>0, 



A3 Regularizers: We require the regularization parameters 

satisfy 

A3-1 A, > ^(^-^-)-^^^ . 



A3-2 Ab > ^(^'T^)"V'°s(?"-) 



" a 

and [12] for the reason) 



A3-3 1 < < r and ^ is not an integer (see Lemma [TT] 



Theorem 1. Suppose A0-A3 hold, and that we obtain estimate 
Q from our algorithm. Then, with probability at least 1 — 
Ci exp(— C2n), we are guaranteed that the convex program 
(O has a unique optimum and 

(a) The estimate Q has no false inclusions, and has bounded 



Supp{0) C Supp{Q), and 



le - eii 



4ct2 log (pr) 



nCn 



(2) 



(b) The estimate Q has no false exclusions, i.e., 
sign{Supp{Q)) — sign (^Supp{Q)^ provided that 
mill e^''' > bjnin for 6min defined in part (a). 

The positive constants Ci,C2 depend only on 7s,7fc,As,Af, 
and a, but are otherwise independent of n,p,r, the problem 
dimensions of interest. 

Remark: Condition (a) guarantees that the estimate will 
have no false inclusions; i.e. all included features will be 
relevant. If in addition, we require that it have no false 
exclusions and that recover the support exactly, we need to 
impose the assumption in (b) that the non-zero elements are 
large enough to be detectable above the noise. 



where, Uk denotes the support of the fc-th column of 0, and 
U = [jf. Uk denotes the union of the supports of all tasks. We 
will also find it useful to define 



7s := 1 — max max 

l<fc<rjeW = 



Y-(fe) xik) 



Cmin 



min Xmin 

l<k<r \ n 



X(k) y(k) 



> 0. 



Also, define D 



max • — max 

l<fc<r 



consequence of A2, we have that Umax is finite. 



Note that by the incoherence condition Al, we have 7^ > 0. 
A2 Minimum Curvature Condition: 



B. General Gaussian Designs 

Often the design matrices consist of samples from a 
Gaussian ensemble (e.g. in Gaussian graphical model 
structure learning). Suppose that for each task k = 1, . . . ,r 
the design matrix X'^^^ G is such that each row 

Xl ' S M*' is a zero-mean Gaussian random vector with 
covariance matrix 

g Kpxp^ ^^jj independent of every 
other row. Let S^fj^ e r|v|x|w| ^^it submatrix of E^^) 
with corresponding rows to V and columns to U. We requke 
these covariance matrices to satisfy the following conditions: 



As a Q\ Incoherence Condition: 
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y{k) 



y(k) 
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Control Parameter Q 



(a) a = 0.3 



- p^512 




Control Parameter 6 



(b) a : 




Control Parameter 9 



(c) a = 0.8 



Fig. 1. Probability of success in recovering the true signed support using dirty model, Lasso and £i/£oo regularizer. For a 2-task problem, 
the probability of success for different values of feature-overlap fraction a is plotted. As we can see in the regimes that Lasso is better than, 
as good as and worse than £i/£oo regularizer l |(a)||(b)| and |(c)| respectively), the dirty model outperforms both of the methods, i.e., it requires 
less number of observations for successful recovery of the true signed support compared to Lasso and £i/£oo regularizer. Here s = [t^J 
always. 



C2 Minimum Curvature Condition: 



Cn 



and let Dmax ■= 



These conditions are analogues of the conditions for 
deterministic designs; they are now imposed on the covariance 
matrix of the (randomly generated) rows of the design matrix. 

C3 Regularizers: Defining s :— max^ \Uk\, we require the 
regularization parameters satisfy 

(4a^C„„.log(pr-))^^^ 



C3-1 A, > 



C3-2 Ab > 



7s V nCjnin — y/^s log(pr) 
(4<7"C„i„r(rlog(2)+log(p)))'^" 



76V"C„„„-y^2sr(r log(2)+log(p)) " 



C3-3 1 < Y' < r and y' is not an integer. 



Theorem 2. Suppose assumptions C1-C3 hold, and that the 
number of samples scale as 



n > max 



2slog(pr) 2sr(rlog(2) +log(p)) 



Suppose we obtain estimate Q from our algorithm. Then, with 
probability at least 

1-ci exp (-C2 (r log(2) + log(p)))-C3 exp(-C4 log(rs)) 1 



for some positive numbers ci — C4, we are guaranteed that 
the algorithm estimate O is unique and satisfies the following 
conditions: 

(a) The estimate O has no false inclusions, and has bounded 
norm error so that 



Supp{Q) C Siipp{0), and 



iie-eii 



50ct^ log(rs) 



nC„ 



+ \s 



4s 



+ D 



(3) 

(b) The estimate O has no false exclusions, i.e., 
sign{Supp{Q)) — sign (^Supp{Q)^ provided that 



U,k)esuppi&) I 



> gminfor 5niin defined in part (a). 



C. Quantifying the gain for 2-Task Gaussian Designs 

This is one of the most important results of this paper. Here, 
we perform a more delicate and finer analysis to establish 
precise quantitative gains of our method. We focus on the 
special case where r = 2 and the design matrix has rows 
generated from the standard Gaussian distribution M{0, Inxn)- 
As we will see both analytically and experimentally, our 
method strictly outperforms both Lasso and £i/£oo-block- 
regularization over for all cases, except at the extreme end- 
points of no support sharing (where it matches that of Lasso) 
and full support sharing (where it matches that of £i/ioo)- We 
now present our analytical results; the empirical comparisons 
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are presented next in Section |IV] The resuhs will be in terms 
of a particular rescaling of the sample size n as 

n 



(2~ a)slog{p- (2-a)s)' 
We also require that the regularizers satisfy 



Fl A. > 



F2 A, > 



4(t2(1 



/s/n){log(r) + log(p — (2 — a)s)) 



1/2 



- ^/i - ((2 - a) s (log(r) + log(p - (2 - a)s)))^^^ ' 

, „ , N 1/2 

(40-2 (1 - yi7^)r (r log(2) + log(p - (2 - j 

- - ((1 - a/2) sr (r log(2) + log(p - (2 - 



F3 



V2. 



Theorem 3. Consider a 2-task regression problem in, p, s, a), 
where the design matrix has rows generated from the standard 
Gaussian distribution In-xn)- Suppose 



max 

j6S* 



e 



■(1)1 



< cXs 



where, B* is the submatrix of Q* with rows where both entries 
are non-zero and c is a constant specified in Lemma Then 
the estimate Q of the problem ([T]l satisfies the following: 
(Success) Suppose the regularization coefficients satisfy Fl — F3. 

Further, assume that the number of samples scales as 
9{n^p, s,a) > 1. Then, with probability at least 1 — 
ci exp(— C2n) for some positive numbers c\ and C2, we 
are guaranteed that satisfies the support-recovery and 
£aa error bound conditions (a-b) in Theorem^ 
(Failure) If 9{n,p^ a) < 1 there is no solution {B,S) for 
any choices of \s and Xb such that sign (^Supp{Q) 
sign {Supp{Q)y 



Remark: The assumption on the gap e 



■(1) 



■(2) 



reflects the fact that we require that most values of Q* to be 
balanced on both tasks on the shared support. As we show in 
a more general theorem (Theorem 4) in Section [VI-CI even in 
the case where the gap is large, the dependence of the sample 
scaling on the gap is quite weak. 

IV. Simulation Results 

In this section, we provide some simulation results. First, 
using our synthetic data set, we investigate the consequences 
of Theorem 3 when we have r = 2 tasks to learn. As we see, 
the empirical result verifies our theoretical guarantees. Next, 
we apply our method regression to a real datasets: a hand- 
written digit classification dataset with r tasks (equal to 
the number of digits — 9). For this dataset, we show that our 
method outperforms both LASSO and lijloa practically. For 
each method, the parameters are chosen via cross-validation; 
see supplemental material for more details. 

A. Synthetic Data Simulation 

Consider a r = 2-task regression problem of the form 
s,a) as discussed in Theorem 3. For a fixed set of 
parameters (n, s,p, a), we generate 100 instances of the 
problem. Then, we solve the same problem using our model, 
^i/^oo regularizer and LASSO by searching for penalty 



regularizer coefficients independently for each one of these 
programs to find the best regularizer by cross validation. After 
solving the three problems, we compare the signed support of 
the solution with the true signed support and decide whether 
or not the program was successful in signed support recovery. 
We describe these process in more details in this section. 

Data Generation: We explain how we generated the data 
for our simulation here. We pick three different values of 
p = 128,256,512 and let s = [O-lpJ- For different values 
of a, we let n = cs log(p — (2 — a)s) for different values 
of c. We generate a random sign matrix 6* e W^"^ (each 
entry is either 0, 1 or —1) with column support size s 
and row support size (2 — a)s as required by Theorem 
3. Then, we multiply each row by a real random number 
with magnitude greater than the minimum required for sign 
support recovery by Theorem 3. We generate two sets of 
matrices X'^^ and W and use one of them for training 

and the other one for cross validation (test), subscripted 
Tr and Ts, respectively. Each entry of the noise matrices 
PVti-, Wts G M"^^ is drawn independently according to 
A/'(0, cr^) where a = 0.1. Each row of a design matrix 
Xj'^\Xj'^^ S R"^'' is sampled, independent of any other 
rows, from A/'(0,l2x2) for all fc 1,2. Having X'^''\ Theta 
and W in hand, we can calculate Irrjlxs G R"^^ using the 
model yf'^) = X'^^^O^''^ + w^^"> for afl fc = 1, 2 for both train 
and test set of variables. 

Coordinate Descent Algorithm: Given the generated 
data Xjf'' for k ~ 1,2 and Y^^ in the previous section, 
we want to recover matrices B and S that satisfy ([T]i. We 
use the coordinate descent algorithm to numerically solve 
the problem (see Appendix IbJ. The algorithm inputs the 
tuple (X^\x^\YT:t,\s,\b,€,B_,S) and outputs a matrix 
pair {B,S). The inputs {B_,S_) are initial guess and can be 
set to zero. However, when we search for optimal penalty 
regularizer coefficients, we can use the result for previous 
set of coefficients (Ab,As) as a good initial guess for the 
next coefficients (Ab + ^, A^ + C,). The parameter e captures 
the stopping criterion threshold of the algorithm. We iterate 
inside the algorithm until the relative update change of the 
objective function is less than e. Since we do not run the 
algorithm completely (until e = works), we need to filter 
the small magnitude values in the solution {B, S) and set 
them to be zero. 

Choosing penalty regularizer coefficients: Dictated by 
optimality conditions, we have ^ > > ^- Thus, searching 
range for one of the coefficients is bounded and known. We 

set Xh = c\J "^^"^^ and search for c G [0.01,100], where 
this interval is partitioned logarithmic. For any pair (AhjA^) 
we compute the objective function of Yts and Xj.J for 
k = 1,2 using the filtered {B, S) from the coordinate descent 
algorithm. Then across all choices of (A;,,As), we pick the 
one with minimum objective function on the test data. Finally 
we let 8 = Filter(i3 + S) for {B, S) corresponding to the 
optimal (Afc, A^.). 
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Shared Support Parameter a 



Fig. 2. Verification of the result of the Theorem 3 on the behavior of 
phase transition threshold by changing the parameter a in a 2-task; 
{n,p, s, a) problem for our method, LASSO and ii/ioo regularizer. 
The j/-axis is siog(pJ(2-a)3) ' where n is the number of samples at 
which threshold was observed. Here s — [^J . Our method shows 
a gain in sample complexity over the entire range of sharing a. The 
pre-constant in Theorem [5] is also validated. 



B. Handwritten Digits Dataset 

We use a handwritten digit dataset to illustrate the 
performance of our method. According to the description of 
the dataset, this dataset consists of features of handwritten 
numerals (0-9) extracted from a collection of Dutch utility 
maps This dataset has been used by a number of papers 
ild. lZI] as a reliable dataset for handwritten recognition 
algorithms. 

Structure of the Dataset: In this dataset, there are 200 
instances of handwritten digits 0-9 (totally 2000 digits). 
Each instance of each digit is scanned to an image of the 
size 30 X 48 pixels. This image is NOT provided by the 
dataset. Using the full resolution image of each digit, the 
dataset provides six different classes of features. A total of 
649 features are provided for each instance of each digit. 
The information about each class of features is provided in 
Table |I] The combined handwriting images of the record 
number 100 is shown in Fig |3] (ten images are concatenated 
together with a spacer between each two). 



Performance Analysis: We ran the algorithm for five 
different values of the overlap ratio a e {0.3, |, 0.8} with 
three different number of features p E {128,256,512}. For 
any instance of the problem {n, p, s,a), if the recovered matrix 
G) has the same sign support as the true Q, then we count it as 
success, otherwise failure (even if one element has different 
sign, we count it as failure). 

As Theorem 3 predicts and Fig shows, the right scaling 
for the number of oservations is — , — -, — ^^ — rr^ where all 
curves stack on the top of each other at 2 — a. Also, the 
number of observations required by our model for true signed 
support recovery is always less than both LASSO and £i/£oo 
regularizer Fig |l(a)| shows the probability of success for the 
case a = 0.3 (when LASSO is better than ii/£oo regularizer) 
and that our model outperforms both methods. When a = \ 
(see Fig |l(b)| i, LASSO and Ixjloo regularizer performs the 
same; but our model require almost 33% less observations for 
the same performance. As a grows toward 1, e.g. a — 0.8 as 
shown in Fig |l(c)| Ixjloa performs better than LASSO. Still, 
our model performs better than both methods in this case as 
well. 

Scaling Verification: To verify that the phase transition 
threshold changes linearly with a as predicted by Theorem 
3, we plot the phase transition threshold versus a. For five 
different values of a e {0.05, 0.3, |, 0.8, 0.95} and three 
different values of p e {128,256,512}, we find the phase 
transition threshold for our model, LASSO and lijloa 
regularizer We consider the point where the probability of 
success in recovery of signed support exceeds 50% as the 
phase transition threshold. We find this point by interpolation 
on the closest two points. Fig |2] shows that phase transition 
threshold for our model is always lower than the phase 
transition for LASSO and ^i/^oo regularizer. 



Fitting the dataset to our model: Regardless of the nature 
of the features, we have 649 features for each of 200 instance 
of each digit. We need to learn K — 10 different tasks 
corresponding to ten different digits. To make the associated 
numbers of features comparable, we shrink the dynamic range 
of each feature to the interval —1 and 1. We divide each feature 
by an appropriate number (perhaps larger than the maximum 
of that feature in the dataset) to make sure that the dynamic 
range of all features is a (not too small) subset of [—1,1]. 
Notice that in this division process, we don't care about the 
minimum and maximum of the training set. We just divide 
each feature by a fixed and predetermined number we provided 
as maximum in Table I] For example, we divide the Pixel 
Shape feature by 6, Karhunen-Loeve coefficients by 17 or the 
last morphological feature by 18000 and so on. We do not 
shift the data; we only scale it. 

Out of 200 samples provided for each digit, we take n < 
200 samples for training. Let X^*^) = X e Ri0"x649 for 
< fc < 9 be the matrix whose first n rows correspond to 
the features of the digit 0, the second n rows correspond to 
the features of the digit 1 and so on. Consequently, we set the 
vector yf'^') e {0, l}i*'" to be the vector such that yj*^-* = 1 if 
and only if the j*'' row of the feature matrix X corresponds 
to the digit fc. This setup is called binary classification setup. 

We want to find a block-sparse matrix B £ ][^649xio ^ 
sparse matrix S € r649xio^ jj^^j foj. ^ given feature vector 
X G R^**^ extracted from the image of a handwritten digit 
< fc* < 9, we ideally have fc* = argmaxo<fc<9 x (^B + Sj . 

To find such matrices B and S, we solve We tune 
the parameters A;, and A^ in order to get the best result 
by cross validation. Since we ha ve 10 ta sks, we search 
for A. e [^,1] and let A, = c^^MM « ^, where, 
empirically c e [0.01, 10] is a constant to be searched. 

Performance Analysis: Table shows the results of our 
analysis for different sizes of the training set as We 
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Feature 


Size 


TvDe 


Dyniiiiiic Rsnge 


1 


Pixel Shape (15 X 16) 


240 


Integer 


0-6 


2 


2D Fourier Transform Coefficients 


74 


Real 


0-1 


3 


Kai'hunen-Loeve Transform Coeficients 


64 


Real 


-17:17 


4 


Profile Correlation 


216 


Integer 


0-1400 


5 


Zernike Moments 


46 


Real 


0-800 


6 


Morphological Features 


3 


Integer 


0-6 


1 


Real 


100-200 


1 


Real 


1-3 


1 


Real 


1500-18000 



TABLE I 

Six different classes of features provided in the dataset. The dynamic ranges are approximate not exact. The dynamic range of 
different morphological features are completely different. for those 6 morphological features, we provide their different 

dynamic ranges separately. 




Fig. 3. An instance of images of the ten digits extracted from the dataset 



measure the classification error on the test set for each digit 
to get the 10-vector of errors. Then, we find the average error 
and the variance of the error vector to show how the error 
is distributed over all tasks. We compare our method with 



(-i/d-ao reguraUzer method and LASSO. 



V. Proof Outline 

In this section we illustrate the proof outline of all three 
theorems as they are very similar in the nature. First, we 
introduce some notations and definitions and then, we provide 
a three step proof technique that we used to prove all three 
theorems. 

A. Definitions and Setup 

In this section, we rigorously define the terms and notation 
we used throughout the proofs. 

Notation: For a vector v, the norms £i, £2 a nd Iqq are 

denoted as ||w||i = X]fe l"'"'^'' |' ll^lb = \[^kW'^^ 
and ||w||oo — maxfc |, respectively. Also, for a 
matrix Q E K^^^, the norm ((^/ip is denoted as 
IIQIIp.C = II dkillC' • • Ikpllc) lip- The maximum singular 
value of Q is denoted as XmaxiQ)- For a matrix X E R"^p 
and a set of indices W C {1, • • the matrix Xu E R"^l'^l 
represents the sub-matrix of X consisting of Xj's where 
JEU. 

1 ) Towards Identifying Optimal Solution: This is a key ste£ 
in our analysis. Our proof proceeds by choosing a pair B, S 
such that the signed support of _B + S* is the same as that of 
Q, and then certifying that, under our assumptions, this pair 
is the optimum of the optimization problem (HJ. We construct 
this pair via a surrogate optimization problem - dubbed oracle 
problem in the literature as well as our proof outline below 
- which adds extra constraints to ([T]! in a way that ensures 
signed support recovery. Making the oracle problem is a key 
step in our proof. 

For ([T]i, let d — [y^]; in this paper we will always have 
1 < d < r, where we recall r is the number of tasks. Using this 
d, we now define two matrices B*,S*, such that B* + S* — 9, 
as follows. In each row Qj, let vj be the {d + 1)*'' largest 
magnitude of the elements in Qj. Then, the {j,kY^ element 
s- of the matrix S* is defined as follows 
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9nn 




Our Model 




LASSO 


5% 


Average Classification EiTor 




8.6% 


9.9% 


10.8% 




Variance of Error 




0.53% 


U.OH /C 


51% 




Average Row Support Size 


B:165 


B + 5:171 


170 


123 




Average Support Size 


5:18 


B + S:1651 


1700 


539 


10% 


Average Classification EiTor 




3.0% 


3.5% 


4. 1% 




Vaiiance of Error 




0.56% 


0.62% 


68% 




Average Row Support Size 


B:211 


B + S:226 


217 


173 




Average Support Size 


S:34 


B + S:2118 


2165 


821 


20% 


Average Classification Error 




2.2% 


3.2% 


2.8% 




Variance of Error 




0.57% 


0.68% 


0.85% 




Average Row Support Size 


B:270 


B + S:299 


368 


354 




Average Support Size 


S:67 


B + S:2761 


3669 


2053 



TABLE II 

Simulation Results for our model, £i/£aa and LASSO. 



In words, to obtain S* we take the matrix 8 and for each 
element we clip its magnitude to be the excess over the {d + 
1)*'* largest magnitude in its row. We retain the sign. Finally, 
define B* — Q — S* to be the residual. It is thus clear that 

• S* will have at most d non-zero elements in each row. 

• Each row of B* is either identically 0, or has at least d 
non-zero elements. Also, in the latter case, at least d of 
them have the same magnitude. 

• If any element {j, k) is non-zero in both S* and B* then 
its sign is the same in both. 

S* thus takes on the role of the "true sparse matrix", and B* 
the role of the "true block-sparse matrix". We will use B* , S* 
to construct our oracle problem later. The pair also has the 
following significance: our results will imply that if we have 
infinite samples, then B* , S* will be the solution to ([T]i. 
2) Sparse Matrix Setup: For any matrix S, define 



Supp(5) = {(j,fc) 



^ 0}, and let U, = {S e 



^pxr 



Supp(S') C Supp(5*)} be the subspace of matrices whose 
their support is the subset of the matrix S* . The orthogonal 
projection to the subspace Us can be defined as follows: 







(j,fc)esupp(5*) 

ow. 



We can define the orthogonal complement space of Us 
to be U^ = {S e RP^'' : Supp(5) n Supp(S'*) = (f)}. 
The orthogonal projection to this space can be defined 
as Pu^iS) = S - PusiS). Since the type of the block- 
sparsity we consider is a block-sparsity assumption on 
the rows of matrices, we need to characterize the sparsity 
of the rows of the matrix S* . This motivates to define 
D{S) — maxi<j<p ||sj||o denoting the maximum number of 
non-zero elements in any row of the sparse matrix S. 

3) Row-Sparse Matrix Setup: For any matrix B, define 
RowSupp(B) = {j : 3k s.t. bf^ ^ 0}, and let [/^ = {B e 
RP^'^ : RowSupp(S) C RowSupp(B*)} be the subspace of 
matrices whose their row support is the subset of the row 
support of the matrix B* . The orthogonal projection to the 
subspace Uh can be defined as follows: 



{PuAP)), 



j e RowSupp(S*) 
ow. 



We can define the orthogonal complement space of Ub to be 



The orthogonal projection to this space can be defined as 

Pu^iB) = B - PuAB). 

For a given matrix B e W'', let Mj{B) = {k : 
l^j'"'*! ~ 1 1 I loo > 0} be the set of indices that the 
corresponding elements achieve the maximum magnitude 
on the j*^ row with positive or negative signs. Also, let 
M{B) — mini<j<p \Mj{B)\ be the minimum number of 
elements who achieve the maximum in each row of the 
matrix B. 



The following technical lemma is useful in the proof of all 
three theorems. 

Lemma 1. If{B,S) ='Hd{&) then 
(PI) M{B) >d+l and D{S) < d 

(P2) sign{s^p) — sign{b^'''') for all j € RowSupp{B) and 
k € Mj{B). 

(P3) sf ^ = 0/or all j e RowSupp{B) and k (f. Mj{B). 



Proof: The proof follows from the definition of H. 



{B G 



RowSupp(B) n RowSupp(B* 



0}. 



B. Proof Overview 

The proofs of all three of our theorems follow a primal-dual 
witness technique, and consist of two steps, as detailed in 
this section. The first step constructs a primal-dual witness 
candidate, and is common to all three theorems. The second 
step consists of showing that the candidate constructed in the 
first step is indeed a primal-dual witness. The theorem proofs 
differ in this second step, and show that under the respective 
conditions imposed in the theorems, the construction succeeds 
with high probability. These steps are as follows: 

STEP 1: Denote the true optimal solution pair (S*, 5*) = 
Hd(e) as defined in Section ES] for d = [^J. See 
Lemma [1] for basic properties of these matrices B* and S* . 

Primal Candidate: We can then design a candidate optimal 
solution (5, B) with the desired sparsity pattern using a re- 
stricted support optimization problem, called oracle problem: 
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1 II 

(5,i?)earg min — V - X^^') f + 6^=) 



(C2) Pjj^Z) = 







t^^^ > iMc/; that J2k£M,{B') '-j — ■^b- 



k e Mj{B* 

O.W.. 

I," — Afc 



where, 



A.ll^lli.i + AfcllBlli, 



Dual Candidate: We set Z\ 



(4) 

as the subgradient of 



the optimal primal parameters of ^ . Specifically, we set 

Zb J 



(C3) 
(C4) 



Pu;iZ) 



< Xs- 



< h 



oo.l 



where, Zg = AsSign(S'), and for all j E Ufc=i^fe' 
Ab — As||sj||o 



-sign (bf^^ 



keMj{B) & (j, fc) <^ Supp(^) 



lo 



To get an explicit form for Zp|r^i|o, let A = B+S-B* -S* . 
From the optimality conditions for the oracle problem (|4]i, we 
have 

i«)<))A<;:)-i«))^('=)+^) = o. 

and consequently. 



(C5) (6«+sW)-i(xW)V'^+zW = 

Vl< fc < r. 



Proof: By assumptions (CI) and (C3), j^Z e 9|| 5*111,1 
and by assumptions (C2) and (C4), e 9||i3||i,oo- Thus, 
{S, B, Z) is a feasible primal-dual pair of ([T]i according to 
the Lemma [T3] 

Let M and § to be balls of £00/ ii and ^oo/^oo with ra- 
diuses Xb and As, respectively. Considering the fact that 

Ab||S||i_oo = sup2g„(Z, B) and As||5||i,i = sup^^g (Z, S*), 
the problem ([T]i can be written as 



(5) 



Solving for Zp,. for all j e 0^=1 ^fe' we get 



w 



(k) 



(k) 

Substituting for the value of A^, , we get 



^3 - 



(6) 

STEP 2: This step consists of showing that the pair 
{S, B, Z) constructed in the earlier step is actually ?i feasible 
primal-dual pair of ([T]). This would then the required support- 
recovery result since the constructed primal candidate 5, B 
had the required sparsity pattern by construction. 

We will make use of the following lemma that specifies 
a set of sufficient (stationary) optimality conditions for the 
{S, B) from (|4| to be the unique solution of the (unrestricted) 
optimization problem ([l]): 

Lemma 2. Under our (stationary) assumptions on the design 
matrices X^'^\ the matrix pair {S,B) is the unique solution 
of the problem ([T]l if there exists a matrix Z € such that 



(CI) PuAZ)^ X.Sign (s). 



iS,B) 



arginf sup / ^ ^ LW_X« f 



{Z,S) 



(Z, B) 



This saddle-point problem is strictly feasible and convex- 
concave. Given any dual variable, in particular Z, and any 
primal optimal {S,B) we have Ab||i3||i.oo — (Z,B) and 



■'Jill 



< Xb 

< Xb 



AsllS'llia = {Z,Sy This implies that h, if 

(because Xb J2j ll^jlloo < W^jWi ll^jlloo and if | 
for some jo. then others can not compensate for that in the sum 
due to the fact that Z e B, i.e., ||zj||i < Af,). It also implies 
< A, for a similar reason. Hence, 



that sl''^ = if 



Pu^{B) — and Pu^{S) — 0. This means that solving the 
restricted problem is equivalent to solving the problem ([T]). 

The uniqueness follows from our (stationary) assumptions 
on design matrices X'^'^^ that the matrix ^ (^X^^ , X^^^ is 
invertible for all 1 < fc < r. Using this assumption, the 
problem is strictly convex and the solution is unique. 
Consequently, the solution of ([T]i is also unique, since we 
showed that these two problems are equivalent. This concludes 
the proof of the lemma. 



By construction, the primal-dual pair (B, S, Z) satisfies the 
(CI), (C2) and (C5) conditions in Lemma|2] It only remains to 
guarantee (C3) and (C4) separately for each of the theorems. 
Indeed, this is where the proofs of the theorems differ. 
Specifically, Lemmas [3] |5] and [S] ensure these conditions are 
satisfied with given sample complexities in Theorems 1, 2 and 
3, respectively. 
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VI. Proofs 



The proofs of our three main theorems are in sections IVI-AI 
IVI-BI and IVI-CI respectively. 



Supp(S'*), we have (a^.''^ + s*^''^) sign (s*^''^) > 0. By 
Hoeffding inequaUty, we have 

p[(Af' + .f')s.g„ o] 



A. Proof of Theorem 1 

Let d = [^J and 
follows from Proposition [T] below. 



Let d = [^J and {B*,S*) = HdiQ)- Then, the result 



. (fc) . / t{k)\ ^ *(fe) 



Af < U; 



Proposition 1 (Structure Recovery). Under assumptions 
of Theorem 1, with probability 1 — Ci exp(— C2ri) for some 
positive constants ci and C2, we are guaranteed that the 
following properties hold: 

(PI) Problem ([T]l has unique solution {S,B) such 
that Supp{S) C Supp{S*) and RowSupp{B) C 
RowSupp{B*). 



By part (P2), this event happens with high probability if 



mm 

j^RowSupp(B*) 
(i.fc)eSupp(S*) 



Kfe) 



> 6n 



(P2) 



B + S-B* - S* 



/ 4g2 log (pr) 

< \ I h A^JJ 



Cn 



(P3) sign {Supp{sj)) — sign (Supp{s*)) 

for all j (fi RowSupp{B*) provided that 



mm 

(J ,k)€Sui7p{S* ) 



Kfe) 



(P4) Using (PI) in Lemma [TT] this event is equivalent to 
the event that for all j £ RowSupp(i?*), we have 

(Af ) + bf^ + sf^) sign (bf^ + sf^) > 0. By Ho- 



effding inequality, we have 



[(Af + s:gn(6;'' 



+ >o| 



Af^'sign (b*/ 



A« < + . 



By part (P2), this event happens with high probability if 



mm 

(i,fe)eSupp(B-) 



(P4) sign {Supp{sj + bj)^ = sign {Supp{s* + h*)) 
for all j £ RowSupp{B*) provided that 



mm 

U,k)eSupp{B'-) 



^*{k) _l_ ^*(fe) 



Lemma 3. Under conditions of Proposition^ the conditions 
(C3j and (C4) in Lemma\2\ hold for the constructed primal- 
dual pair with probability at least 1 — Ci exp(— C2n) for some 
positive constants ci and Ci. 

Proof: First, we need to bound the projection of Z into 
the space C/^. Notice that 



Proof: We prove the result separately for each part. 
(PI) Considering the constructed primal-dual pair, it suffices 
to show that (C3) and (C4) in Lemma |2] are satisfied 
with high probability. By Lemma [3] with probability at 
least 1 — ci exp(— C2n) those two conditions hold and 
hence, {S, B) = {S, B) is the unique solution of ([T]) and 
the property (PI) follows. 



(P2) Using (|5]i, we have 



max A**' < 
jeWfc I J I 



1 /yW x<-''A\ 



1 4g-2 log (pr) 

where, the second inequality holds with high probability 
as a result of Lemma |4] for a 
e > 1, considering the fact that Var ( A'*^' ) < 



/ 42__log(pr) gQjj^g 



(P3) Using (PI) in Lemma [TT] this event is equivalent to 
the event that for all j ^ RowSupp(i?*) with (j,k) e 



~ \ (fc) 



j e RowSupp(_B) & {j,k) ^ Supp(5) 



\-(k)\ 



i 6 n K 

i,— 1 



By our assumption on the ratio of the penalty regularizer 

^b^^s IIq 



coefficients, we have 



\M,{B)\-\\S, Wo 



< As. Moreover, we have 



\z) < max 



+ Z, 



+ 

< (2 - 7.) 

< (2 - 7.) 



1 






n 




CXD 


1 






n 




■DO 



+ (1-7.) pi 
+ (1-7.)A, 
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Thus, the event 
the event 



max 

l<ft<r 



Pu^Az)\\. 



< As is equivalent to B. Proof of Theorem 2 



„(fe) 



2 -7s 



By 



Lemma |4l this event happens with probabihty at least 
1-2 exp (- 4(2^!"^)^=^ + ^og{pr)\. This probability goes to 

1 it As > ^ — ^ — - as stated in the assumptions. 

Next, we need to bound the projection of Z into the space U^. 
Notice that 



Let d = [^J and {B* , S*) = HdiO)- Then, the result 
follows from the next proposition. 

Proposition 2. Under assumptions of Theorem 2, if 

Bs log(pr) Bsr[r log(2) + log(p)) \ 



n > max 



^min Is 



a 



E 



(k) 



i e U Wfc -RowSupp(B*) 



(fc) 

'J 



I 



We have As||sj||o < XsD{S*) < Xb by our assumption on the 
ratio of the penalty regularizer coefficients. We can establish 
the following bound: 



E 



\z{k)\ 



then with probability at least 1 — 
ci exp (-C2 (r log(2) + log(p))) - C3 exp(-C4 log(rs)) 
for some positive constants C\ — 04^, we are guaranteed that 
the following properties hold: 

(PI) The solution {B, S) to (|7} is unique and RowSupp{B) C 
RowSupp{B*) and Supp{S) C Supp{S*). 



(P2) \\B + S~B*-S'\\ <j!5^lM!Z)+A. 



Ds 



< max > 



1 /y(fe) y(fe)\ / 1 



max 

l<fe<r 



+ max 

l<fe<r 



< (1 - 76)A6 + (2 - 7i,) max 

i<fc<i^ n 



1 ^xw^^^c^) 



Thus, the event (^)||oo,i < Ab is equivalent to 



^Ai,. By 



the event maxi<fc<r ^X'^^'j it;*''' 
Lemma H] this event happens with probability at least 
1 — 2 exp {— + log(p^) )■ This probability goes to 

jj: ^ ^ 2(2-7t,)tTVlog(pr) 



1 



7b \/" 



as stated in the assumptions. 



Hence, with probability at least 1 — ci exp(— C2n) conditions 
(C3) and (C4) in Lemma |2] are satisfied. 



Lemma 4. 



max 

l<fc<r 



< a 



> 1-2 exp ( +log(pr) 



Proof: Since wf^''s are distributed as J\f{0,a^), we 



have i (X^^))^ w^*^) distributed as TV (o, ^ (X^'^))^ X, 
Using Hoeffding inequality, we have 



(fc) 



„(fe) 



> a 



<E^ 



> a 



<^2exp 



< 2pexp 



1^ 2.^(xf))^xf); 



4(t2 



By union bound, the result follows. 



(P3) sign {Supp{sj)) — sign {Supp{s*)^ 

for all j ^ RowSupp{B*) provided that 



mm 



*ik) 



> 9n 



(P4) sign (Supp{sj + bj)^ = sign [Supp{s* + b*)) 



for all j £ RowSupp{B*) provided that 



mm 

ij,k)esupp{B') 



,*(k) *{k) 



Proof: We provide the proof of each part separately. 
(PI) Considering the constructed primal-dual pair (5^, B, Z), 
it suffices to show that the conditions (C3) and (C4) 
in Lemma |2] are satisfied under these assumptions. 
Lemma |5] guarantees that with probability at least 
1 — ci exp (— C2 (r log(2) + log(p))) those conditions 
are satisfied. Hence, {B, S) = {B, S) are the unique 
solution to ([T]) and (PI) follows. 



(P2) From Q, we have 



max A<'°' < 



l/^(fc) ^(k)\\ '1 



X-, 



('=)A^,„{fc) 



-(fe) 



(fc) „(fc)^ 



- S 



-(fc) 



(fe) 



;(fc) 



We need to bound these three quantities. Notice that 



~{k) 



< 



Ak) 



00,1 
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Also, we have 



"fc II2 



^ \ ^fc ' ^k / 



- s 



.(fc) ^ 



Cmin V ^ 

where, the last inequality holds with probabiUty at least 
1 — ci cxp ^— C2 (v^ — a/s)^) for some positive con- 
stants ci and C2 as a result of on eigenvalues of 

(k) 

Gaussian random matrices. Conditioned on X^^J, the 
vector yV'C') g mI'^'^I is a zero-mean Gaussian random 
vector with covariance matrix — ( - (x!P,x!' 
Thus, we have 



^max 



^ ^max 

n 



(k) ^ 



5 



From the concentration of Gaussian random variables 
(Lemma lU and using the union bound, we get 



max W 

l<fc<r 11 



(fe) 



> t 



< 2exp 



-I- log (rs) 



Fort 



50cT^ log(rs) 



for some e > 1, the result follows. 

(P3),(P4) The results are immediate consequence of (P2). 



Lemma 4 in flT]), we have 



< max 
:>eriUi^k 



J ' ri ^k \ ri \ ^k' l^k 



X 



{k)\ 



+ max 



< max IVV*''^ I 



max 



^jMk y^u^Mk 



I -(k)\ 



max 



<{l-7s)As+ max |7^*.'''|+ max Iw**^!, 

The second inequality follows from the triangle inequality on 
the distributions. By Lemma|6] if n > r- log(pr) then with 

2 2— v3 

high probabiUty xf^ < 



2n and hence Var^wj''') < 



Using the concentration results for the zero-mean Gaussian 

(k) 

random variable Wj and using the union bound, we get 



max > t 



< 2exp ( +log(p) 



Conditioning on (^xjj^ ,w^''\ z'^'^^^'s, we have that TZ 
zero-mean Gaussian random variable with 

2 



yt > 0. 

(k) ■ 

^ ' IS a 



Var 



< 



Ak) 



By concentration of Gaussian random variables, we have 



< 



max l7^ > t 



< 2exp 



u^nui^k 

Using these bounds, we get 



BsA? 



+ log(p) \/t>0. 



Lemma 5. Under the assumptions of Proposition |2] the 
conditions (C3) and (C4) in Lemma |2] hold for the con- 
structed primal-dual pair with probability at least 1 — 
ciexp(— C2 (r log(2) + log(p))) for some positive constants 
Ci and C2. 

Proof: First, we need to bound the projection of Z into 
the space U^. Notice that 



A(, - Aspjilo 
|m,(S)|-||S,||o 

j e RowSupp(B) & (j, k) ^ Supp(S) 

i 6 h 



-(fe) 



fc=l 
ow. 



By our assumptions on the ratio of the penalty regularizer co- 



A,,-A, 



efficients, we have , , , 

|M,(B)|-||s,||o 

and R E M^^'' with i.i.d. standard Gaussian entries (see 



P[/c(Z) <A, 

I ^ 1 1 00 ,00 



> p 


max 










> p 


max 











> 1 - 2 exp 



max W^''n<7sAs Vl<fc<r 



max Ivvf 'I < 7sAs - io VI < A: < r 
'-log(pr) 



1 - 2 cxp 



(7sAs - toY^n 
4a2 



This probability goes to 1 for to 



- log(pr) 



/Bs\, 



=7sAs (the 



t c 

solution to —, ' 



BsXi 



^^''40^*°"' if the regularization param- 



'-fP^ < Xs. For all j G HLi ^k eter A, > 



Jia^Cminlogipr) • J J 4.1, » ^ Bslog(pr) 

provided that n > ^ '^^^.^ ' 



73 y'nC^in -^Bs log(pr) 

as stated in the assumptions. 
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Next, we need to bound the projection of Z into the space \J^. 
Notice that 



E 



~ \ (fc) 



El 

fc=i 





i 6 U Wfe -RowSupp(B*) 

e n "fc 



and consequently by concentration of Gaussian variables, 



max V l7^f'| > t 



max max \^ v^R}^^ > 



We have As|]sj|lo < XsD{S*) < by our assumption on 
the ratio of the penalty regularizer coefficients. For all j G 
riLi ^fc' we have 



E 



< 2exp 
Finally, we have 

\\Pu^A^)\\ o-b 

" *■ lloc.l 

ma; 



2rs\l 



+ rlog(2) + log(p) Vt>0 



< max y 



X-, 



max y + max V wf < 76^6 

r 

max 7^^ 



< to 



+ max y 

r 

' max >^ W 



max \^ 



(fc) 



1 



max 



max \^ 



1 

.(fc)l 



-(fc) 



< (1 ^ 7b)-^6 + max 7?.' 



(fc) 



max W 



(fc) 



Let V G { — 1,+!}'' be a vector of signs such that 



(k) 



ELi^feWf -Then, 



max y;|wf' I <76A6-to 
> ( l-2cxp|-^|^+rlog(2) + log(p) 

1 _ 2 exp (Jj^h^Z^ + , log(2) + log{p) 



4(T2r 



This probability goes to 1 for to — ,„ , (the 



solution to 



(7b-^6 — *o) " _ t^nCmi 



-), if 



A), > 



V 


4(7^ Cmin'^ 


(rlog(2) +log(p)^ 








;^rlog{2) + log{p)) 



asr(rlog(2)+lQg(p)) 



as stated in 



provided that n > 

the assumptions. Hence, with probability at least 
1 — ci exp (— C2 ("r log(2) + log(p))) the conditions of 
the Lemma |2] are satisfied. 



VarK^lwf =VarK]z;,.wf < 



\fc=i / \fe=i / 

Using the union bound and previous discussion, we get 



max V W^''^ 



> t 



max 



X max WfcW,-'^'' 



> t 



< 2 exp 
We have 



+ rlog(2) +log(p) yt>0 



(fc) 



rsA,, 



< 



Lemma 6. 



max max lix'!''' | < 2n 
i<fc<j- i<j<p II ^ II2 



> 1 - exp ( -(l-^)n + log{pr) 



Proof: Notice that ||^j'°''||2 is a random variable with 
n degrees of freedom. According to [8], we have 



X 



(fe) 



< exp(-t) yt > 0. 



Letting t ~ ^ ) " using the union bound, the result 
follows. ■ 

C. Proof of Theorem 3 

We will actually prove a more general theorem, from which 
Theorem 3 would follow as a corollary. Among shared features 
(with size as), we say a fraction r has different magnitudes 
on Q. Let ti be the fraction with larger magnitude on the first 
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task and T2 the fraction with larger magnitude on the second 
task (so that t ~ ti + T2). Moreover, let — k and 



1 



/(k) = /(k, t, q) = 2 — 2(1 — T)a — 2rQK - 
and 

g(K,T,a) = max 2~' ^ 

Theorem 4. Under the assumptions of the Theorem 3, if 



||j e RowSupp(B') : I 



(1)1 



i|| = (1 - T)as, 



then, the result of Theorem 3 holds for 

9{n,s,p, a) 



g{K, T, a) s log (p - (2 - a)s) ' 

Corollary 4. Under the assumptions of the Theorem 4, if 
the regularization penalties are set as k = \b/^s = V^, 
then the result of Theorem 3 holds for 9{n, s,p, a) — 



(2-a+(3-2V2)Ta)slog(p-(2-Q)s) ' 

Proof: Follows trivially by substituting k = \/2 in 
Theorem 4. Indeed, this setting of k can also be shown to 
minimize g{K, r, a): 



mm max 

1<K<2 



= min min — (/(k)) , min /(k) 

\l<K<^/2 K ^/2<K<2 

= 2 - a + (3 - 2V2) r a. 

■ 

Proof of Theorem 3: The proof follows from Corollary H] 
by setting t = and k = \/2. 

We will now set out to prove Theorem 4. We will first need 
the following lemma. 



< cXs for 



Lemma 7. For any j e RowSupp{B*), if 

some constant c specified in the proof then Sj = with 
probability 1 — ci exp(— 02^). 

Proof: Let 5* be a matrix equal to S except that Sj ' = 
0. Using the concentration of Gaussian random variables and 
optimality of S, we get 

p [[sj'^'l > 

2nXs |5f> I < - X('=)(J3('=' + 5('=')|[ 

_|L{'=)_xW(bW + 5W)| 



y{k) _ x(*)(_B('=) + St*)) 

< ' ' ||gw^w|| ' 

II J J II 2 



J II2 



II J J II 2 
2nXs < 2 llxf ) f LC^) - X('=)(B('=) + sC^')!! 



\\X 



nXs < \\X 



ll^(fc) (^.(fc) _^ g.Cfe) _ ^(fc) _ g(fe)) ^ ^(fc) I 



Using the £ao bound on the error, for some constant c, we 
have 



5f ^1 > o| < : 



-^n<\\xfn 



Notice thatEfllX 



(fc)|l2l 



^ _ n. According to the concentration 
of random variables concentration theorems (see (Stl), this 
probability vanishes exponentially fast in n for 



<cXs. 



D. Proof of Theorem 4 

We will now provide the proofs of different parts separately. 

Proof: (Success): Recall the constructed primal-dual 
pair {B,S,Z). It suffices to show that the dual variable 
Z satisfies the conditions (C3) and (C4) of Lemma |2] By 
Lemma [H these conditions are satisfied with probability at 
least 1 — ci exp(— C2n) for some positive constants ci and C2. 
Hence, {B,S) = {B, S) is the unique optimal solution. The 
rest are direct consequences of Proposition |2] for Cmin ~ 1 



(Failure): We prove this result by contradiction. Sup- 
pose there exist a solution to (HJ, say {13, S) such 
that sign (^Supp(B + S")) = sign (Supp(S* + S"*)). By 

Lemma [TTl this is equivalent to having sign ^Supp(i3)^ = 

sign (Supp(i3*)) and sign (^Supp(S')^ = sign (Supp(S'*)) and 

Xb 

a7 - ^• 

Now, suppose n < {1 — v) max (^^^^, /(k)^ s log(p— (2 — 
a)s), for some > 0. This entails that 

either (i) n < (1 — j/)/(k)s log(j3 — (2 — a)s), 
or (ii) n<{l-u) (^) slog(p - (2 - a)s). 

Case (i): We will show that with high probability, there 
exists k for which, there exists j G fYk=i^k ^ViCh that 
^j''^ > Xs. This is a contradiction to Lemma [T3l 

Using ^ and conditioning on (X^^ ,w^'^\ Z^^), for all 

j G rife=i ^k have that the random variables Zj are 
i.i.d. zero-mean Gaussian random variables with 



Var ( Z 



-7{k)-\ 



- I - -X 



(fe) 



7ik) 



^ \ Uu ^ Ui, 



X-, 



(fc)V 



„(fe) 



X-, 



(k)\ 



15 



The second equality holds by orthogonality of projections. We 
thus have 



> max 



1 /^(fc) 



=(fe) 



The second inequality holds with probability at least 1 — 
ci exp ^— C2 {y/n + \/s)^^ as a result of [6] on the eigen- 
values of Gaussian matrices. The third inequality holds with 
probability at least 1 — C3exp(— C4n) as a result of Ist] on 

the magnitude of random variables. Considering B + S, 
assume that among shared features (with size as), a portion 
of Ti has larger magnitude on the fist task and a portion of 
T2 has larger magnitude on the second task (and consequently 
a portion of 1 — ti — T2 has equal magnitude on both tasks). 
Assuming Xt — kXs for some k e (1, 2), we get 

al :=Var(z(^) 



(1 - a)sX^ + nasA^ + T2as{Xi, - X^)^ + (1 - ri - T2)os^ 



The first equality follows from the construction of the dual 
matrix and the fact that we have recovered the sign support 
correctly. The last strict inequality follows from the assump- 
tion that 9(n,p,s,a) < 1. Similarly, we have 

^ (1 - a)sA2 + r2asA2 + Tias{Xt - Xs)^ + (1 - ri - T2)os^ 



^ f2{K)sX^s 

Given these lower bounds on the variance, by results on 
Gaussian maxima (see li6i]), for any 5 > 0, with high proba- 
bility. 



max max 



> (1 - 6)^{aj+^^,)log (r(p-(2-a).)). 



This in turn can be bound as 

(1 - 5) {af + 5|) log (r(p - (2 - 
>{l-5) 



(/l(«)+/2(«)) 


s log 


H 


P- (2-a)s)) 


n 




IS 

n 





> (1 - 5) 
Consider two cases 



/(k) s log (^r(^p - (2 - a)s)) 



n [1 + 



= (1-5)- 



1) ^ = ^(1)- In this case, we have s > cn for some 
constant c > 0. Then, 

(/(^)) s log (r(p- (2 -a)s)) ^ 

{f{K)) {s/n) log (r (p- (2- a)s)) ^ 

1 - S) s 

>c'fit,) log (r(p-(2-Q)s)) A^ 
> (l + e)A?, 

for any fixed e > 0, as p — >^ oo. 

2) ^ — > 0: In this case, we have s/n = o(l). Here 
we will use that the sample size scales as n < (1 — 

(/(«)) slog(p- (2 ~a)s). 



(/(^)) s log fr(p-{2-a)s)] 
(1 - S) ^-^ 2 ^. 

^ (l-5)(l-o(l)) ^2 



1 - 
2 



> (i + e)A: 



for some e > by taking 5 small enough. 



Thus with high probability, 3k3j e Cfk^i^k ^.nch that 



(fc) 



> As. This is a contradiction to Lemma [T3] 



Case (ii): We need to show that with high probability, 
there exist a row that violates the sub-gradient condition of 



£oo-norm: 3j e C\k=i ^^ich that 
contradiction to Lemma [T3] 



> Afc. This is a 



Following the same proof technique, notice that 
X]fe=i ^j-'^^ is ^ zero-mean Gaussian random variable 



with Var(^;^^. 
probability 



> r{al + o"!). Thus, with high 



max 



>{l-5)Jr{al + al)\og{p-{2^a) 



Following the same line of argument for this case, yields the 



required bound 



> (l + e)Afc. 



This concludes the proof of the theorem. 



Lemma 8. Under assumptions of Theorem 3, the conditions 
(C3) and (C4) in Lemma\2\hold with probability at least 1 — 
Ci exp(— C2n) for some positive constants Ci and c^. 

Proof: First, we need to bound the projection of Z into 
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the space U^. Notice that 

( Af,- A.llSjIlo 



i 



|Af,(B)| -IIS.IIo 

j e RowSupp(_B) & (j, k) i Supp{S) 

fc=i 

ow. 

By our assumption on the penahy regularizer coefficients, we 
have < ^s- Moreover, we have 



-7(k)\ 



< max 



max 



7(fe) 



max LE 



(fc)| 



max W' 



By Lemma|6] \fn> log(p-ftr) then with high probabihty 



2 



(fc) 



E 



< 2n and hence Vai- (^wj''^^ < Notice that 



(fe) 



— n and we added the factor of 2 arbitrarily 



to use the concentration theorems. Using the concentration 

(k) 

resuhs for the zero-mean Gaussian random variable and 
using the union bound, for all i > 0, we get 



max W*'°' > t 



<2exp(-— +log(p-(2-a)s) ). 



Conditioning on [Xy\w^''\ Z^'^M's, we have that zf^ is a 



zero-mean Gaussian random variable with 



Var Z. 



?(fe) 



n 



(fc) 



According to the result of on singular values of Gaussian 
matrices, for the matrix Xj^^, for all 5 > 0, we have 

P K»™ «) < (1 - 5) (v^- v^)] < exp p'^^;^^' ) , 



and since A„ 



(l + '5) 



, we get 



1 



< exp 



(V5TT-l)'(v^-v^)' 



2(1 + .5) 



According to Lemma |7] if 



0*(i) 



0*(2) 



then with high probabihty Sj = 0, so that |9*^^| ^ \Q) 
Thus, among shared features (with size as), a fraction r have 
differing magnitudes on 8. Let ti be the fraction with larger 
magnitude on the first task and T2 the fraction with larger 
magnitude on the second task (so that r = ti+T2). Then, with 



= o{\s), 
^(2)| 



high probability, recalhng that At, = kAs for some I < k < 2, 
we get 

|2 



7(1) 



(1 - a)sX^ + TiasXl + T2«s(Ab - As)^ + (1 - ri - r2)as^ 
(1 - (1 - Ti - r2)a - 2T2QK + (t2 + i^lLula) asZj ^^'i 



Similarly, 



7(2) 



1 - (1 - Ti - r2)a - 2riaK + (n + ^"^i"^'^ 



sA? 



A /2(k)sA2 



By concentration of Gaussian random variables, we have 



max Le'*"' > t 

<2cxp( ^ ^ +log (p- (1 - a)s) Vt>0. 



Using these bounds, we get 



c/o(Z) <A, 
1 1 00 , 00 


max 




max 





max iwf < As VI < fc < i^: 



max \W^''H<Xs-to Wl<k<r 



> 1 - 2 exp 



'(/l(«)+/2(«))*A2 
>2 



+ log (p - (2 - a)s) + log(r) 



1 - 2 exp ^^2°^ " + log (p - (2 - + log(r) 

This probability goes to 1 for 

V(/iW + /2(«))«sA, 



to ^ 



fthe solution to *°(^^-^^)' - (^e-*o)^« . 
(.me soiuLion lo (/i(„)+/2(k))sAJ ~ 4<j^ " 



As > 



V 






2 


[^log(r) + log (p-(2-a)s)) 








/2W)s 


;^log(r) +log(p- (2-a)s))j 



provided that (substituting r = 2), 

n > (/i (k) + /2(k)) s log (p - (2 - 
+ (l + (/i(«) + /2(«;))log(2) 



+ 2^ + /2(k)) (log(2) + log (p - (2 - a)s) ) 
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Since + /2(k) = fin) by definition, for large enough p 

with - = 0(1), we require 



and consequently for all t > 0, 



n > f{K)s log — (2 — a)sj . 



(7) 



max y > t 



t2 (v^-^/i)" 



Next, we need to bound the projection of Z into the space U^. 
Notice that 



< 2 exp - — 



- 1 y2 + '^losCS) + log P - (2 - a)s) . 



E 



As||5j||o 



fc=i 




(fc)| 



j e U Wfc -RowSupp(B*) 

fc=i 

r 

i 6 n 



Finally, we have 

||Puc{Z)|| <Ai, 



We have As||5j||o < XsD{S*) < Xb by our assumption on the 
ratio of penalty regularizer coefficients. For all j e Cik^i l^^, 
we have 



(fc) 



< max >^ 



max >^ 



7(fc) 



(fc) 



(fc) 



Let V G { — be a vector of signs such that 

' ^rl^ELi^fcWf-Thus, 

Var(tj<l)-Var(x:..Wf))<^. 

Using the union bound and previous discussion, for all t > Q, 
we get 



max y In'*'"'! > 



max 



jc max y VkW^ ' > t 

2 exp / -—^ +rlog{2) +log (p - (2 - a)s) j 



max 



< to 



max Iw'^'l < Afc - to 



> 1 - 2 exp 



- 1 u i^f ,f ^^ ,2 + ^' + log (p - (2 - a).) 



1 - 2 exp - 



■ + r log(2) + log (p - (2 - a)s) 



This probability goes to 1 for 



to 



A/7^(/i{«) + /2W)nsA6 + 2a(v^ - v^) 

(the solution to = ^^(v^_v^)l^)^ jf 



Ab 



Afc > 



V 






\ 


[^rlog(2 


+ log(p- (2-a)s)) 










;^rlog(2) + log (p-(2-Q)s))j 



provided that (substituting r — 2), 

n> ^(/i(K) + /2(K))slog(p-(2-Q)s) 



+ l+^(/i(«;) + /2(«))21og(2) 



+ 2^-^ + /2(k)) (2 log(2) + log (p - (2 - a)s))y 



For large enough p with | = o(l), we require 



n > — /(K)slog (p- (2 - a)s 



Combining this result with the lemma follows. 



Also from the previous analysis, assuming Ab — nXg for some 
1 < K < 2, we get 



2(1 - a)sXl + (n + T2)asA2 + (n + T2)as(A6 - Xsf + 2(1 - • 



T2)as^ 



7^(/i(«) + /2(K))sAg 



18 



References 

[1] A. Asuncion and D.J. Newman. 

UCI Machine Learning Repository, 

http://www.ics.uci.edu/mlearn/MLRepository.html. 
University of California, School of Information and 
Computer Science, Irvine, CA, 2007. 

[2] F. Bach. Consistency of the group lasso and multiple 
kernel learning. Journal of Machine Learning Research, 
9:1179-1225,2008. 

[3] R. Baraniuk. Compressive sensing. IEEE Signal Pro- 
cessing Magazine, 24(4):118-121, 2007. 

[4] R. Caruana. Multitask learning. Machine Learning, 28: 
41-75, 1997. 

[5] C.Zhang and J.Huang. Model selection consistency of 
the lasso selection in high-dimensional linear regression. 
Annals of Statistics, 36:1567-1594, 2008. 

[6] K. R. Davidson and S. J. Szarek. Local operator theory, 
random matrices and banach spaces. In Handbook of 
Banach Spaces, Elsevier, Amsterdam, NL, volume 1, 
pages 317-336, 2001. 

[7] X. He and R Niyogi. Locality preserving projections. In 
NIPS, 2003. 

[8] B. Laurent and R Massart. Adaptive estimation of 
a quadratic functional by model selection. Annals of 
Statistics, 28:1303-1338, 1998. 

[9] H. Liu, M. Palatucci, and J. Zhang. Blockwise coor- 
dinate descent procedures for the multi-task lasso, with 
applications to neural semantic basis discovery. In 26th 
International Conference on Machine Learning (ICML), 
2009. 

[10] K. Lounici, A. B. Tsybakov, M. Pontil, and S. A. van de 
Geer. Taking advantage of sparsity in multi-task learning. 
In 22nd Conference On Learning Theory ( COLT), 2009. 

[11] S. Negahban and M. J. Wainwright. Joint support 
recovery under high-dimensional scaling: Benefits and 
perils of ^i.oo-regularization. In Advances in Neural 
Information Processing Systems (NIPS), 2008. 

[12] S. Negahban and M. J. Wainwright. Estimation of (near) 
low-rank matrices with noise and high-dimensional scal- 
ing. In ICML, 2010. 

[13] G. Obozinski, M. J. Wainwright, and M. I. Jordan. 
Support union recovery in high-dimensional multivariate 
regression. Annals of Statistics, 2010. 

[14] R Ravikumar, H. Liu, J. Lafferty, and L. Wasserman. 
Sparse additive models. Journal of the Royal Statistical 
Society, Series B. 

[15] R Ravikumar, M. J. Wainwright, and J. Lafferty. High- 
dimensional ising model selection using £i -regularized 
logistic regression. Annals of Statistics, 2009. 

[16] B. Recht, M. Fazel, and R A. Parrilo. Guaranteed 
minimum-rank solutions of linear matrix equations via 
nuclear norm minimization. In Allerton Conference, 
Allerton House, Illinois, 2007. 

[17] R. Tibshirani. Regression shrinkage and selection via the 
lasso. Journal of the Royal Statistical Society, Series B, 
58(l):267-288, 1996. 

[18] J. A. Tropp, A. C. Gilbert, and M. J. Strauss. Algorithms 



for simultaneous sparse approximation. Signal Process- 
ing, Special issue on "Sparse approximations in signal 
and image processing" , 86:572-602, 2006. 
[19] B. Turlach, W.N. Venables, and S.J. Wright. Simulta- 
neous variable selection. Techno- metrics, 27:349-363, 
2005. 

[20] M. van Breukelen, R.RW. Duin, D.M.J. Tax, and J.E. 
den Hartog. Handwritten digit recognition by combined 
classifiers. Kybernetika, 34(4):38 1-386, 1998. 

[21] M. J. Wainwright. Sharp thresholds for noisy and high- 
dimensional recovery of sparsity using ^i-constrained 
quadratic programming (lasso). IEEE Transactions on 
Information Theory, 55:2183-2202, 2009. 

Appendix A 

Deterministic Necessary Optimality Conditions 

In this appendix, we investigate deterministic necessary 
conditions for the optimality of the solutions {B, S) of the 
problem 

A. Sub-differential of £i /£oo <^nd £i / ii Norms 

In this section we state the sub-differential characterization 
of the norms we used in out convex program. The results can 
be directly derived from the definition of sub-differential of a 
function. 

Lemma 9 (Sub-differential of ^i/i?oo-Norm). The matrix Z e 
Rpxf belongs to the sub-differential of £i/ £ao-norm of matrix 



B, denoted as Z Cz d 



B 



iff 



(i) for all j G RowSupp(B), we have 



sign 


(fe) _ 



(?0 ' ^ ''^^""K where, > and 



ow. 



(ii) for all j ^ RowSupp{B), we have 



-(fe) 



< 1. 



Lemma 10 (Sub-differential of ^i/^i-Norm). The matrix Z e 
W^^ belongs to the sub-differential of £i/£i-nonn of matrix 



S, denoted as Z (z d 



1.1 



iff 



(i) for all (j, k) £ Supp{S), we have 

(ii) for all (j, k) ^ Supp{S), we have 



zik) _ 



~(k) 



Sign 
< 1. 



if'} 



B. Necessary Conditions 

The first lemma shows a necessary condition for any solu- 
tion of the problem ([T]i. 

Lemma 11. If {S,B) is a solution (uniqueness is NOT 
required) of (Q then the following properties hold 
(PI) sign{s^p) = sign{Up) for all {j,k) G Supp(S) with 
j e RowSupp{B). 



(P2) if ^ is not an integer. 



(P3) 



D(S) >'b M(B)' 



for all {j, k) e Supp{S). 



(P4) ^ is not an integer, Vj 3k such that {j, k) ^ Supp{S) 



and 
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Proof: We provide the proof of each property separately. 
(PI) Suppose there exists (jo,fco) G Supp(S'), such that 
sign(s^.''') = -sign(&^.''^). Let 5,5 G R^^'' be matrices 
equal to B^S in all entries except at {jo,ko). Consider 
the following two cases 



1) 



2) 



and 



< 



Let bi'"'^ 



0. Notice that (jo, ko) 



Supp(5). 



30 

-sigi 



JO 



> 



and s 



Let b 



(fco) 
jo 



liko) 



b[''°\ Notice that sig 

Jo 



Sign ( s)^^ ' 



and 



Since B + S = B + S and \\bjJ\oo < \\bj„\ 
Pjolli < Piolli' i'^ is ^ contradiction to the optimality 
of {13, S). 

(P2) We prove the result in two steps by establishing 1. 



M{B) > 



and 2. D{S) < 



1) In contrary, suppose there exists a row jo S 



RowSupp(S) such that \Mjg{B) 



\m,,{b) 


< 


At, 
As 


lement whose mag 



Let 



among the element of the 



is ranked 
vector bjg - 

to B, 5 in all entries except on the row jo and 



Sj„. Let B, S e W^^ he matrices equal 



and 



JO 



sign I b 



JO 



30 30 



30 



aiiu jjg — Sjg -r ujg uj„. ly^^i^^ Liiai 

M{B) > [^\ and sign (^J^j - sign (foj-f ) 



Notice that 
sigr 



sign 



4k) 
^30 



for all (jo,fc) G Supp(sj,-|) since sign 

for all (jo,fc) e Supp(^4o) 
(PI). Further, since S + B = 

116 



S 



B and 



^30 1 1 oo 
Sjolll - 



bf^ 

30 



+ 






*io 



and 



S Jo 111 



< 



bf'> 

30 



,ik' 
'30 



this 



is a contradiction to the optimality of {B, S) due 



to the fact that Ac 



< Ah 



(fe) 



and sign (4o^) ^ ^'S" (' 
Supp(sjo) since sign (s^-^^) 
(jo,fc) e Supp(%,). Since S + B = S + B and 



for all (jo,fc) e 
= sign ffejli^'' j for all 



16,0 I 



Jo 



'jo 



and 



r(fe- 



^ JO 111 



< 



«Jolll + 

this is 



a contradiction to the optimality of {B, S), due to 



the fact that A, 



< A, 



< Ah. 



(P3) If j ^ RowSupp(i?) then the result is trivial. 
Suppose there exists (jo,fco) G Supp(S') with 



dko) 
30 



< 



jo € RowSupp(S') such that 

Let B,S e W^"^ be matrices equal to B,S in all 
entries except for the entry corresponding to the 



index (jo,A:o). Let 



;(feo) , f(ko) 



> 



JO 

(feo) 



Sign (55;;") 



if 



l^jolloo and 6^^^ 



— /.C^o) I ~(ko) 
" "30 ^ *Jo 



Otherwise. Let s^-'""'' — s*-*^" 



B 



''Jo 111 



5 = 
< lis 



Jo Jo 

B + S and ||L- 



yiko) 

30 



l{ko) 
30 

ho 



Since 
and 



Jo 111' 



it is a contradiction to the optimality 



of {B,S). 



(P4) If j ^ RowSupp(i3) or j ^ RowSupp(S') the result is 
trivial. Suppose there exists a row jo G RowSupp(i3) n 
RowSupp(S') such that the result does not hold for that. 
Let k* = argmax^j^^^(^.^^s^pp(^jj bf^ . Let B,S £ 
]]jpxr matrices equal to B,S in all entries except for 
the row jo and 

"(Sjf) (jo,A:)eSupp(5) 

ow, 

. Since 13 + S = S + B 
and by (P2) and (P3), 



Jo 



JO 



Slgl 



and Sj„ = Sj„ + bj„ - b 



and 



'JO 111 



■^jo 

< IIS 



00 



Jo 111 



'Ik'} 



- 1 



^Jo 



jo 



this 



is a contradiction to the optimality of {B, S), due to the 



fact that A, 



- 1 < A. 



< Ah. 



This concludes the proof of the lemma. 



2) In contrary, suppose there exists a_ row jo G 

Let k* be 



'Jo Ho 



> 



RowSupp(S') such that 
the index of the element whose magnitude is ranked 



among the elements of the vector bj^ + Sj 



Let B, S E MP^^ be matrices respectively equal to 
13 and S in all entries except on the row jo and 





- s 


sign (fef^)) 


JO 


JO 





JO JO 



b) + s) 



and Sjg = Sja+bjo ^^jo- Notice that 13(5') < 



The next lemma shows why the assumption that the ratio of 
penalty regularizer parameters is crucial for our analysis. This 
is not a deterministic result, but since it is related to optimality 
conditions, we included this lemma in this appendix. 

Lemma 12. If (5, B) with B ^ is a solution to (Q and 
d — ^ is an integer then (5, B) is not the unique solution. 

Proof In contrary, assume that (5, B) is the unique 
solution. Take a non-zero row bj^ with jo G RowSupp(i?). 

< d, then let B,S e Wp^'' be two matrices 



If 



equal to B, S except on the row jo and let bj^ — and 
•^jo = 6j„ +Sj„. Then, {B,S) are strictly better solutions than 
{13, S). This contradicts the optimality of {B,S). Hence, 
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If 



A{jg{B) > d. with similar argument we can conclude that 



< d. 



^ = d, then let < (5 < inm^j,^ k)esupp(s) 



4k) 
s- 

30 



and 



B{6),S{S) e W^"^ be two matrices equal to B, S except for 
the entries indexed {jo,k) G Supp(S') and let fo^^-* = Uj'^^^ + 
<5sign (bfj) and sfj = sfj - <5sign (sj^) for all (jo, k) e 

Supp(S'). Then, {B{d),S{S)) is another solution to ©• This 
contradicts the uniqueness of {B, S). 



If 

have 



< d, then using Lemma \TT\ and Equation |5] we 



[\M,„{B)\ >d+l 

r — d 
r — d 



, . . . , fc,+i eMjg(B) V« = 1, . . . , i + 1 



" JO JO ' 



El 



r — d 



3fci, 



, fei+i eAfj(,(B) VZ,m = 1, . . . ,i + 1 



In above equation Cki,k^ are some constants. The last conclu- 
sion follows from the fact that a['"''''s are continuous Gaussian 

]0 

variables and the cardinality of this event is less than the 

'M,,,{B)\=d. 

'"^ be two 



cardinality of the space they lie in. Hence, 



Let < (5 < 



and B{S),S{S) e 



matrices equal to B^S except for the entries indexed {jQ,k) 
for k e M,JB) and let bi'"^ = bi'"^ - 6 and s['"^ = ^ + S 

J"^ ' ^ Jo _ Jo__ 30 3a 

for all k e Mjg{B). Then, {B{5), S{S)) is another solution to 
([T]l. This contradicts the uniqueness of {B,S). 

■ 

Next lemma characterizes the optimal solution by 
introducing a dual variable Z. 

Lemma 13 (Convex Optimality). If (B, S) is a solution of 
^ then there exists a matrix Z E SP^^', called dual variable, 
such that Z G AsC?||S'||i,i and Z G A69||-B|| i,oo and for all 
k^l,...,r, 

" " (8) 

Proof: The proof follows from the standard first order 
optimaUty argument. ■ 



Appendix B 
Coordinate Descent Algorithm 

We use the coordinate descendent algorithm described as 
follows. The algorithm takes the tuple {X,Y, Xg, Xb,£, B, S) 
as input, and outputs {B,S). Note that X and Y are given 
to this algorithm, while B and S are our initial guess or 
the warm start of the regression matrices, e is the precision 
parameter which determines the stopping criterion. 

We update elements of the sparse matrix S using the 
subroutine UpdateS, and update elements in the block sparse 
matrix B using the subroutine UpdateB, respectively, until 
the regression matrices converge. The pseudocode is in 
Algorithm 1 to Algorithm 3. 



Algorithm 2 Our Model Solver 



Input: X, Y, Xb, A^, B, S and e 
Output: S and B 



Initialization: 

for J = 1 : p do 
for fc = 1 : r do 

^3 

for i = 1 : n do 



end for 
end for 
end for 



W ^(fe)\ 



/ 



Updating: 
loop 

S UpdateS{c; d; A^; B; S) 
B ^ UpdateB {c; d;Xb;B;S) 
if Relative Update < e then 

BREAK 
end if 
end loop 

RETURN B ^ B, S ^ S 



A. Correctness of Algorithms 

In this algorithm, B is the block sparse matrix and 5* is 
the sparse matrix. We alternatively update B and S until 
they converge. When updating S, we cycle through each 
element of S while holding all the other elements of S and 
B unchanged; When updating B, we update each block Bj 
(the coefficient vector of the j*'* feature for r tasks) as a 
whole, while keeping S and other coefficient vector of B fixed. 

For updating B, the subproblem is updating Bj 



bj — argmm 



-±\ 

k=l 



+ Afc||6,||oo.(9) 



If we take the partial residual vector 



_ J2iis\"'Xl^>), the correctness 



(fc) Y{k)^ 



1^3 
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Algorithm 3 UpdateB 



Input: c, d, Ab, B and S 
Output: B 

Update B using the cyclic coordinate descent algorithm for 
^i/^oo while keeping S unchanged. 



for J = 1 : p do 
for fc = 1 : r do 



3 



ifELil«f^l<^ then 



else 



Sort a to be \af'\ > \af'\ >■■■> |af'-^| 

TO* = argmaxi<„i<r(X;fe=i la,-''"''! - Xb)/m 
for z = 1 : r do 

if i > 771* then 
6^. ^ a^. 

else 

6fVf!5t?^(Er:ii«f"i-A^ 

end if 



end for 
end if 
end for 
end for 

RETURN B 



Algorithm 4 Update-S 



Input: c, d, A^, B and S 
Output: S 

Update 5 using the cyclic coordinate descent algorithm for 
LASSO while keeping B unchanged, 
for j = 1 : p do 
for A; = 1 : r do 

if laf^l < A^ then 



.,^4-0 



else 



end if 
end for 
end for 

RETURN 5 



^fe^af _A.sign(af ) 



of this algorithm will directly follow from the correctness of 
coordinate descent algorithm of txjlinf in With the same 
argument, the correctness of the Algorithm 3 can be proven. 



